Many multi-value returns #1147
Many multi-value returns #1147
Conversation
When a value is stored into a stack slot, it is normally done using |
These two lines in the log:
mean that the solver (during the coloring phase of regalloc) is making room so that v4 can be in rax, which is the ABI return register (when only one value is returned); then v0 (which already lives in rax) must be diverted, so it must be added as a solver variable so rax is available for v4. Is it possible instead to change the ABI location requirements of return values in |
Yeah, the tricky thing that is different from |
But the thing is Basically, the problem is that I don't know why (a) updating the
I can look into it / thought I was doing that by setting the values' Regarding specifically creating the stack slot when looking at return values, I don't think that can work since the stack slot is in the caller's frame, not the callee's frame. |
I think this is the proper solution, as you would need a separate machine instruction to store the result anyway in many cases and every machine instruction corresponds to a clif instruction. (While the conditional trap instructions map to multiple machine instructions, they are an exceptional case.) |
// Make room for the maximum amount of incoming return pointer space we'll | ||
// need. | ||
// | ||
// TODO FITZGEN: alignment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a comment re. alignment requirements of this return region on the stack, or if this affects alignment of the computed layout? (if the latter, the layout is currently always made to leave the stack aligned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is regarding the alignment required by the return values. For example, if for some reason there is an i32
stack slot, and then we need another stack slot for 8-byte-aligned data, we would need to add padding after the i32
slot (or rearrange them) and I'm just not sure yet how much of this infra already exists and is done for me automatically or not, and how to express these requirements if it does exist.
@@ -56,6 +58,14 @@ pub fn layout_stack(frame: &mut StackSlots, alignment: StackSize) -> CodegenResu | |||
.ok_or(CodegenError::ImplLimitExceeded)?; | |||
outgoing_max = max(outgoing_max, offset); | |||
} | |||
StackSlotKind::RetPtr => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a kind of silly question: where are StackSlot
with kind RetPtr
added? I'd expect iterating all calls in a function and adding the necessary slots from that, but don't see anything of the sort.
edit: I think I see the relevant TODO
in legalizer/boundary.rs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a followup, and I think there ought to be a test case (or plural!) around this, if we're reusing the largest RetPtr
slot for different multi-return calls with spills, I'd like to be sure that we know this stack slot is clobbered on reuse. That way Cranelift can spill out of this slot into somewhere that return values can be safely read from. eg v0
should be moved out of the stack-returns area in this clif, which I think is pretty close to valid:
function %retptr_reuse() fast {
sig0 = () -> i32, i32, i32, i32, i32, i32 fast
sig1 = () -> f32, f32, f32, f32, f32, f32 fast
ireturns = u0:0 sig0
freturns = u0:1 sig1
ebb0():
v0, v1, v2, v3, v4, v5 = call icalls()
v6, v7, v8, v9, v10, v11 = call fcalls()
; I'm concerned v6 will overwrite v0, as things are currently
}
I don't know if the notion of stack slots being reused for different purposes is even something Cranelift knows about right know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't really started on implementing the caller side yet; sorry I should have been more clear in the op.
Yeah I think it is worth not attempting to reuse these stack slots at all in the first pass implementation, and then circling back to add the reuse as a new PR.
let (sig_ref, args) = match dfg[inst].analyze_call(&dfg.value_lists) { | ||
CallInfo::Direct(func, args) => (dfg.ext_funcs[func].signature, args), | ||
let (sig_ref, args) = match func.dfg[inst].analyze_call(&func.dfg.value_lists) { | ||
CallInfo::Direct(f, args) => (func.dfg.ext_funcs[f].signature, args), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small name comment, target
or dest
instead of f
?
cranelift-codegen/src/isa/x86/abi.rs
Outdated
ret_ptr_param.location = loc; | ||
sig.params.insert(0, ret_ptr_param); | ||
} | ||
ArgAction::Convert(_) => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to phrase this as unreachable!("`pointer_type` shouldn't require conversion")
?
Ok, I re-read it and read your code too. v0 is not dead, since it's referenced in the return instruction. Does it work when you only keep the last value (i.e. the ret struct point) in the IR return instruction, when legalizing the signature? There's apparently no need to keep the other values live in the ret instruction, since they're manually moved to their final location before the actual ret. |
Thanks for the feedback and tips @bjorn3, @iximeow, and @bnjbvr!
I think this actually makes more sense than the I'll try legalizing this signature
into
and legalizing returns into stores, as before, except with only the single return retptr value
and finally legalizing calls like this
This feels very promising and I think I can actually remove a lot of the extra Looking back, I don't know why I was so determined to keep the multiple return values in the signature, instead of letting them dissolve away into the loads and stores like they actually do at the machine level... |
13abc6c
to
cb6bede
Compare
It could help with a consumer generating debuginfo for return values from the post compilation ir. |
ef4bf0a
to
19b884f
Compare
Ok! I have all tests passing plus a bunch of new tests specifically for multi-value stuff :) I think this is ready for review! r? @bnjbvr |
57db582
to
6a9809b
Compare
6a9809b
to
c1ff3e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is on the right path, great work.
- Re: the commit message of
Support arbitrary numbers of return values
, could you remove the examples please? The test cases provide enough context for what's going on imo. - Can you add tests for
call_indirect
, please? - This one is trickier, because i'm not sure we have enough runtime support in the
run
test framework for this (in particular with respect to linking), but: it would be extra nice to haverun
tests for functions returning multiple values and checking that the returned values are correct. - Have you considered de-grouping return values, that is, instead of having a single
RetPtr
stack slot, have severalOutgoingRet
stack slots? I wonder if this wouldn't make a few things simpler, like, no need to precompute the size of the entire ret stack space, so the two big loops when legalizing a call instruction could be fused.
Yeah, the |
In order to pass in a single |
…of-a-power-of-2 function
Instead, let the existing frame layout algorithm decide where they should go.
b8ef407
to
f03ee7e
Compare
Do not merge! This is a work-in-progress!EDIT: ready for review now :)I'm working on supporting arbitrarily many return values (although we will need to enforce some reasonable upper bound) in cranelift functions, and therefore in Wasm as well.
The approach I've been trying is to:
Make use of the
sret
special argument for passing in a return pointer, pointing to space allocated in the caller's stack frame for the return values.Introduce new
ArgumentLocation
andValueLocation
variants for essentially "offsetx
from this function'ssret
parameter":*(sret + x)
.During legalization, we rewrite signatures, calls, and returns to introduce usage of the
sret
return pointer parameter. Specifically,legalize_signature
will add thesret
parameter and assign offsets for each return value (currently implemented),handle_call_abi
will rewrite multi-value calls to allocate a stack slot for the return values, pass a pointer to that space as thesret
argument, and then load the return values out of the stack slot (not yet implemented),and
handle_return_abi
will rewrite multi-value returns to store the values into thesret
return pointer at the proper offsets, and return thesret
return pointer in a register.This almost works! However, I'm running into a little trouble during register allocation.
Consider this function that returns four
i64
s:After legalization, we see that an
sret
parameter has been added, that all the returns are assigned to offsets from thesret
, and that we are storing the return values at those offsets just before returning. Great!However, when register allocation comes along, it unnecessarily inserts an instruction to move
v0
from%rax
to%rsi
. I'm a bit confused why this is happening, since I update theValueLocation
to an offset from thesret
, and it shouldn't even think thatv0
is live in%rax
! Anyways, even if this unnecessary instruction were included, things would otherwise work, except that the register allocator also recorded the new location ofv0
as%rsi
, which means that now the verifier complains that teh arguments are in the wrong places!Any idea what might be going on and what I should do from here?
Full debug logs
One idea I was considering (but don't actually know if this is a good idea or just a kludgy workaround) was introducing a
sret_store
(orstore_copy
) instruction that would return a new copy of the value that is located at the stored location (similar tocopy
andspill
).Basically, this would turn the previous example into:
But I didn't really want to pursue this further without getting feedback from y'all.
Thanks!